EN FR
EN FR


Section: Scientific Foundations

HP-SoC simulation, verification and synthesis

Many simulations at different levels of abstraction are the key of an efficient design of embedded systems. The different levels include a functional (and possibly distributed) validation of the application, a functional validation of the application and an architecture co-model, and a validation of a heterogeneous specification of an embedded system (a specification integrating modules provided at different abstraction levels).

SoCs are more and more complex and integrate software parts as well as specific hardware parts (IPs, Intellectual Properties). Generally before obtaining a SoC on silicium, a system is specified at several abstraction levels. Any system design flow consists in refining, more or less automatically, each model to obtain another, starting from a functional model to reach a Register Transfer Level model. One of the biggest design challenges is the development of a strong, low cost and fast simulation tool for system verification and simulation.

The DaRT project is concerned by the simulation at different levels of abstraction (SystemC, VHDL) of the application/architecture co-model and of the mapping/schedule produced by the optimization phase.

Foundations

Abstraction levels and Transaction Level Modeling

Currently, Transaction Level Modeling, TLM, is being used in the industry to solve a variety of practical problems during the design, development and deployment of electronic systems.

The TLM 2.0 standard appeared during the very few last years. It consists in describing systems according to the specifications of the TLM abstraction levels. At these levels, function calls simulate the behavior of the communications between architecture components.

Nowadays, this modeling style is widely used for verification and it is starting to be used for design at many major electronic companies. Recently, many actions and challenges have been started in order to help to proliferate TLM. Thus, several teams are working to furnish to designers standard TLM APIs and guidelines, TLM platform IP and tools supports. SystemC is the first system description language adopting TLM specifications. Thus, several standardization APIs have been proposed to the OSCI by all the major EDA and IP vendors. This standardization effort is being generalized now by the OSCI / OCP-IP TLM standardization alliance, to build on a common TLM API foundation. One of the most important TLM API proposals is the one from Cadence, distributed to OSCI and OCP-IP. It is intended as common foundation for OSCI and OCP-IP allowing protocol-specific APIs (e.g. AMBA, OCP) and describing a wide range of abstraction levels for fast and efficient simulations.

In order to keep our design flow coherent, we choose to use two significant simulation levels. Each of them has special advantages.

The main objectives of the PVT level are fast verification of system functionalities and monitoring of the contentions in the interconnection network. Complementary to this level, the CABA level is used to accurately estimate the execution time and power consumption. At the PVT level, details related to the computation and communication resources are omitted. The software application is executed by an instruction-accurate Instruction Set Simulator. Transactions are performed through channels instead of signals. At the CABA level, hardware components are implemented at the cycle accurate level for both processing and communication parts. Communication protocol and arbitration strategy are specified as well. Simulation at the PVT level permits a rapid exploration of a large solution space by eliminating non interesting regions from the DSE process. The solutions selected at this level are then forwarded to a new exploration at the CABA level. At each level, the exploration is based on developed performance and power estimation tools. Code generation at both of those levels needs parameter specifications for execution time, power estimation, and platform configurations. These parameters are specified at the deployment phase.

Due to all TLM's benefits, we defined a TLM metamodel as a top level point for automatic transformations to both simulation and synthesis platforms. Our TLM metamodel contains the main concepts needed for verification and design following the Cadence API proposal. But, as we are targeting multi-language simulation platforms, the meta model is completely independent from the SystemC syntax. It is composed mainly by two parts: architecture and application. This clear separation between SW and HW parts permits easy extensions and updates of the meta model.

  • The architecture part contains all necessary concepts to describe HW elements of systems at TLM levels. The SW part is mainly composed of computation tasks. They should be hierarchical and repetitive. A set of parameters could be attached to each task in order to specify the scheduling dependently of the used computation model.

  • Thus this metamodel keeps hierarchies and repetitions of both the application and the architecture. This permits to still benefit from the data parallelism as far as possible in the design (simulation and synthesis flow). In fact, the designer can choose to eliminate hierarchies when transforming the TLM model into a simulation model, and to keep it when transforming into a synthesis model.

Dynamic reconfiguration - FPGA

Current FPGAs support the notion of Partial Dynamic Reconfiguration which allows part of the FPGA to be reconfigured on the fly hence introducing the idea of virtual hardware. Partial Reconfiguration allows swapping of tasks (mutually exclusive)depending upon user requirements and Quality of service needs. Using such a technology permits to optimize energy consumption and the area in the system. It allows also to have very flexible systems, adaptable for large application classes.

Verification

Our privileged basis for verification is the reactive synchronous domain. Over the last two decades several formal verification technologies have been provided by a very active research community in this domain. Among the available tools, we can mention efficient compilers that act more than usual compilers in that they address more static analysis issues. There are also various model-checkers that use both symbolic representations and non symbolic ones. Some of these model-checkers offer facilities that go beyond verification by enabling the synthesis of (discrete) controllers. Finally, these synchronous technologies give the opportunity in some cases to perform a functional simulation of the described systems.

Contributions of the team

The results of DaRT simulation package concerns mainly the PVT and the CABA levels. We also propose techniques to interact with IPs specified at other level of abstraction (mainly RTL).

Co-simulation in SystemC

From the association model, the Gaspard2 environment is able to automatically produce SystemC simulation code. The MDE techniques offer the transformation of the association model to the SystemC model. During this transformation the data parallel components are unrolled and the data dependencies between elementary tasks become synchronization primitive calls relying on a buffered strategy.

The SoC architecture is produced from the architecture model coupled with a ready-to-use component library. A processing module in SystemC simulates the behavior of tasks mapped to a particular processor.

Other modules contain the data parallel structures and are able to answer to any read/write requests. The communications between tasks and between tasks and memories are simulated via communication modules in SystemC. These last modules produce interesting results concerning the simultaneous network conflicts and the capacity of this network for this application.

A transformation chain within Gaspard2 ensures the code generation from the input model. The produced simulation code is based on SystemC IPs assembling. These IPs are available in the Gaspard2 library in both TLM and CABA levels. They represent all the usual architecture components such as processors (ARM, MIPS, ..etc), memories, caches, buses, NoCs, etc.

Model transformation towards Pthreads

The strategy in previous version of the Gaspard2 framework imposed a global synchronization mechanism between all the tasks of the application. This mechanism does not allow one to reach an optimal execution. We have investigated a new strategy to overcome this problem, based on fine grain synchronizations between the different tasks of the modeled application. For this new strategy, we use the pthread API. Each task of the UML application model is transformed into a thread. The data exchanges between the tasks are ensured by a buffer-based strategy. The best compromise between the memory used and the performance can be reached by adjusting the size of each buffer. Moreover, we have developed this strategy to facilitate its use in simulation targets such as SystemC-PA. The transformation chain towards Pthreads enabled to optimize the global synchronization mechanism between all the tasks of the application provided by the previous version of Gaspard2.

Gaspardlib extensions

The chain towards SystemC code allows simulations at the TLM-PA level. Regarding the architecture design, the process acts at as a connector between existing SystemC modules. They correspond to basic components such as memories, processors, caches. They are gathered in the Gaspardlib to be included or linked at the code compilation step. On one hand, both application and architecture IPs have been modeled using UML to easily drag and drop the available components inside the user?s model. On another hand, we aimed at providing the most flexible design for the SystemC architecture.

The GaspardLib allows a high interoperability for our SystemC components with any other SystemC architecture. Consequently, additional SystemC modules have been integrated to extend the Gaspardlib. They come from other free simulation environments: ReSP,SocLib,Unisim.

Partial and Dynamic Reconfiguration (PDR) implementations

Current Gaspard2 Model transformation chain to Register Transfer Level (RTL) allows to generate two key aspects of a partial dynamically reconfigurable system: namely the dynamically reconfigurable region and the code for the reconfiguration manager that carries out the switch between the different configurations of this dynamic region. For this, the MARTE metamodel has been extended to integrate concepts of UML state machines and collaborations, which help in creation of mode automata semantics at the high abstraction levels. Integration of these concepts in the extended MARTE metamodel helps in the respective model-to-model transformations.

Moreover, the high level application model has several building blocks: the elementary components, each associated to several available intellectual properties (IPs). The current deployment level has been also extended to integrate the notion of ”configurations”, which are unique global implementations of the application functionality, with each configuration comprised of different combinations of IPs related to the elementary components. Using a combination of the deployment level and the introduced control semantics, it is possible for a designer to change the configuration related to an application, resulting in different results such as consumed FPGA resources, reconfiguration times, etc. We incorporate two model-to-model transformations in our flow, first the UML2MARTE transformation, with integrated state machine and configuration concepts. This transformation results in an intermediate MARTE model, which is converted into an RTL model by the MARTE2RTL transformation. The application model is converted into several implementations of a dynamically reconfigurable hardware accelerator, along with the source code for the configuration switch.

Finally, the design flow has been validated in the construction of a dynamically reconfigurable delay estimation correlation modulethat is part of a complex anti-collision radar detection system in collaboration with IEMN Valenciennes. The simulation results from the different configurations correspond to an initial MATLAB result, validating the different configurations. Additionally change of IPs related to a key elementary component in the module resulted in different reconfiguration times proving methodology.

IP based configurable massively parallel processing SoC

A methodology and a tool chain to design and build IP-based configurable massively parallel architectures is proposed. The defined architecture is named mppSoC, massively parallel processing System on Chip. It is a SIMD architecture composed of a number of processor elements (the PEs) working in perfect synchronization. A small amount of local and private memory is attached to each PE. Every PE is potentially connected to its neighbors via a regular network. Furthermore, each PE is connected to an entry of mpNoC, a massively parallel Network on Chip that potentially connects each PE to one another, performing efficient irregular communications. All the system is controlled by an Array Controller Unit (ACU). Our objective is to propose then a methodology to produce FPGA implementations of the mppSoC architecture.

The whole mppSoC architecture with its various components is implemented following an IP based design methodology. An implementation on FPGA, ALTERA StratixII 2s180, is proposed as a proof of feasibility. The architecture consists of general IPs (processor IPs, memory IPs, etc.) and specific IPs supplied with the mppSoC system (control IPs, etc.). Specific IPs are used as a glue to build the architecture. General IPs present a defined interface which must be respected by the designer if it wants to produce its own IP. For this kind of IPs we provide a library to alleviate their design. The designed architecture is configurable and parametric. In fact, to construct a mppSoC system, we assemble IPs to generate a FPGA configuration. The designer has to make different choices. He has to determine the different components in his architecture, for example if it contains an irregular communication network with a defined interconnection router or a neighborhood one or both. Since we propose a parametric architecture, he has to choose also some architectural parameters such as the number of PEs, the memory size and the topology of the neighborhood network if it exists. After fixing the architecture, the designer will choose then the basic IPs which will be used such as processor IP, interconnection network IP, etc. By this way, the user can choose the most appropriate mppSoC configuration satisfying his needs. To evaluate the proposed design methodology we have implemented different sized architectures with various configurations. We have also tested some examples of data parallel applications such as FIR, reduction, matrix multiplication, image rotation and 2D convolution. Through simulation results we can choose the most appropriate mppSoC configuration with the optimal performance metrics: execution time, FPGA resources and energy consumption. As a result we have proposed an IP based methodology for the construction of mppSoC system helping the designer to choose the best configuration for a given application. It is a first step towards a mppSoC architecture exploration.

Ongoing work aims at integrating the mppSoC in a real application such a video processing framework. Future work will aim at improving the proposed IP assembling methodology to construct mppSoC systems. Our ultimate goal is to provide a completely tool to generate a mppSoC configuration in order to help the designer in a semi-automatic architecture exploration for a given application.

Caches in MPSoCs

In Multi-Processor System-on-Chip (MPSoC) architectures using shared-memory, caches plays an important impact on performance and energy consumption levels.

When the executed application depicts a high degree of reference locality, caches may reduce the amount of shared-memory accesses and data transfers on the interconnection network. Hence, execution time and energy consumption can be greatly optimized. However, caches in MPSoC architectures put forward the data coherency problem. In this context, most of the existing solutions are based either on data invalidation or data update protocols. These protocols do not consider the change in the application behavior. This paper presents a new hybrid cache-coherency protocol that is able to dynamically adapt its functioning mode according to the application needs.

An original architecture which facilitates this protocol's implementation in Network-On-Chip based MPSoC architectures has been proposed. Performances, in terms of speed up factor and energy reduction gain of the proposed protocol, have been evaluated using a Cycle Accurate Bit Accurate (CABA) simulation platform. Experimental results in comparison with other existing solutions show that this protocol may give significant reductions in execution time and energy consumption can be achieved.

Verification

Guaranteeing the correctness of systems is a highly important issue in the Gaspard2 design methodology. This is required at least for their validation. In order to provide the designer with the required means to cope with validation, we propose to bridge the gap between the Gaspard2 design approach and validation techniques for SoCs by using the synchronous approach and test-based techniques.

We have already defined a synchronous dataflow equational model of Gaspard2 specification concepts. The resulting model is then usable to address various correctness issues: causality analysis that enables to detect erroneous data dependencies (i.e., those which lead to cycles) in specifications, clock synchronizability analysis when such a system model is to be considered on a deployment platform, etc.

Starting from the simulation clock properties of an embedded system (as described previously), we start an analysis of the system behavior. On the one hand, we verify whether or not the functional clock constraints specified by the designer in the application specification are met during the system execution on considered physical resources. When these constraints are not met, the simulation clock traces can be used to reason and find the solutions to satisfy the constraints. For instance, this may amount to decrease the speed of processors that compute data very fast or to increase the speed of processors that compute data very slowly. The modification of the processors speed by increasing or decreasing the speed should always respect the functional constraints imposed by the designer. It appears in the simulation clock traces by determining new physical clock properties from the suitable processor frequencies. Another example of solution may consist in delaying the first activation of a faster processor until an adequate time to begin the execution. Such an activation delay could be seen as minimizing the voltage/frequency. The team examples have highlighted some needs for a better numeric verification of synchronous programs, and we also work on the amelioration of precision of the Signal analysis.

System Level Power Modeling

Due to the ongoing nano-miniaturization in chip production, estimation of power consumption is becoming a critical metric in embedded system design. In current industrial and academic practices, power estimation using low-level CAD tools is still widely adopted. These low level tools are however inconvenient to manage the architecture of modern complex embedded systems. System level power estimation is considered a vital premise to cope with the critical design constraints. The keywords in our contribution are Hybridization and decorrelation between abstraction levels. The hybridization is applied here at 2 levels: granularity of activities used to develop the power models in one side and the level of the considered abstraction on the other side. If almost of studies focus on power estimation for a given abstraction level without overcoming the wall of speed/accuracy trade-off, the idea is to build up hybrid power estimation tool that gathers different abstraction levels of the system to grab the strict relevant data depending on the power estimation process step. Thus, designers build their systems by instantiating different hardware and software IPs (Intellectual Property) from existing libraries. The granularity of the used power models should be coherent with the design approach. In this work, we develop a hybrid system level power estimator for embedded systems. First, power models relying on Functional Level Power Analysis (FLPA) methodology is developed. Secondly, we forge the whole system into a fast simulation framework in order to obtain the system's power consumption data. The combination of the above parts yields to a relatively fast and accurate power estimation. Our experimental results, performed on explicit embedded platform, show that obtained power estimation are less than 1% of error when compared to the measurements realized on the real system. In our work, we further extend the usage of higher abstraction level to speed up the estimation with the help of multigranularity of input data and phase sampling of the application. At the end, the proposed power estimation is 21 times faster than the detailed simulation with a marginal error of 1.5%.

Energy consumption driven dynamic reconfigurable execution model

As a continuation of our work on energy consumption estimation for Systems on Chip (SoC) at the Cycle Accurate Level using SystemC simulation, the aim of our current work is to ensure the adaptivity of SoCs regarding changes at run time of some operating conditions such as consumption constraints. This adaptivity is based on the reconfigurability on the Socs implemented on FPGAs. Here, the energy consumption estimation is not done during simulation anymore but during the execution of the application on the FPGA.

In order to be adaptive to runtime changes, the system architecture has to be changed accordingly. A possible change can be, for example, to change the parallelism degree or to change a processing algorithm in order to consume less energy. The decision of reconfiguring is taken after a negotiation between consumption monitors integrated in the system. This monitors are OCP- compliant, which allows them to be easily integrated and reused for different architectures thanks to the genericity and parametrability of this standard communication protocol.

Up to now, we have started implementing simple systems on FPGA supporting the dynamic reconfiguration taking the user inputs as a criterion of reconfiguration. We also implemented some interface adapters in order to facilitate the future integration of the OCP monitors in the system. As a future work, we intend to integrate the energy consumption as a criterion of reconfiguration using monitors. These monitors are supposed to take decisions of reconfiguration after negotiating between them. Therefore, we started by studying the negotiation used on software systems such as multi-agent systems. We will adapt this for our hardware architecture on FPGA.

Partial dynamic reconfiguration

Partial dynamic reconfiguration modeling [114] , [113] permits to generate two key aspects of a partial dynamically reconfigurable system from high level modeled specifications: namely the dynamically reconfigurable region and the code for the reconfiguration manager that carries out the switch between the different configurations of this dynamic region. Once these aspects are generated using the model transformations, it is possible to use commercial simulation and synthesis tools to implement dynamic reconfiguration in state of the art FPGAs [114] . Currently the intermediate model transformation chain is being updated to make use of the newly introduced intermediate metamodels and model transformations developed by the DaRT team, in order to provide a uniform design flow. Similarly, optimizations related to RTL code generation using Acceleo are also continuing.

However, the MARTE compliant high level specifications lack the means to express architectural details at high abstraction levels. For this reason, an initial exploratory analysis was carried out in [86] that expands the MARTE hardware concepts to include aspects of reconfigurable architectures, and to introduce aspects such as power consumption at these high level models. These works can be described as an initial contribution to the ANR FAMOUS project.

Similarly, MARTE has recently introduced the notion of 'configurations', similar to those introduced in [114] . These concepts permit to express system configuration at the MARTE UML models, but lack guidelines and precise semantics. An overview of these concepts was presented in [112] , which highlights some of the shortcomings of the present concepts and provides an alternative, as described in [114] .

Network on Chip synthesis

The study of Networks on Chip (NoC) is a research field that primarily addresses the global communication in Systems-on-Chip (SoC). The selected topology and the routing algorithm play a prime role in the performance of NoC architectures. In order to handle the design complexity and meet the tight time-to-market constraints, it is important to automate most of these NoC design phases. The use of MARTE in modeling such architectures may provide designers asset of high level concepts to obtain compact and reusable models in a fast way.

Thus we defined a new methodology for modeling concepts of NoC based Architectures. It aims to improve the effectiveness of the MARTE standard by clarifying some notations and extending some definitions in the standard, in order to allow modeling complex NoC architectures.

IP based configurable massively parallel processing SoC

Our mppSoC project proposed a methodology and tool chain to design and build IP-based configurable massively parallel architectures. A mppSoc architecture is a SIMD architecture composed of a number of processor elements working in perfect synchronization, the PEs. Each PE is potentially connected to its neighbors via a regular network. Furthermore, each PE is connected to an entry of mpNoC, a massively parallel Network on Chip that performs efficient irregular communications. All the system is controlled by an Array Controller Unit, the ACU.

The mppSoc project aims at the design and implementation of a given mppSoC architecture to fit the requirements of a given application. The mppSoC architecture model is then configurable and parametrizable and our chain produces FPGA implementations of the mppSoC architecture.

Our last contributions define a model-driven based generation chain integrated in the Gaspard environment. A mppSoC UML model is defined using using the MARTE profile. From this model, our chain allows the generation of the corresponding mppSoC synthetizable VHDL code that can be directly simulated or prototyped on FPGA. Targeting the DE2-70 FPGA board, we have been able to validate some mppSoC configurations running signal processing applications [ref]. This last works conclude Mouna Baklouti PhD thesis [ref].